DiLBERT: Cheap Embeddings for Disease Related Medical NLP
نویسندگان
چکیده
Electronic Health Records include health-related information, among which there is text mentioning health conditions and diagnoses. Usually, also coded using appropriate terminologies classifications. The act of coding time consuming prone to mistakes. Consequently, increasing demand for clinical mining tools help coding. In last few years Natural Language Processing (NLP) models has been shown be effective in sentence-level tasks. Taking advantage from the transfer learning capabilities those models, a number biomedicine specific have developed. However, biomedical can seen as too general some area like diagnostic expressions. this paper, we describe BERT model specialized on tasks related diagnoses conditions. To obtain disease-related language model, created pre-training corpora starting ICD-11 entities, enriched them with documents selected by querying PubMed Wikipedia entity names. Fine-tuning carried out towards three downstream two different datasets. Results show that our besides being trained much smaller than state-of-the-art algorithms, leads comparable or higher accuracy scores all considered tasks, particular 97.53% death certificate coding, 81.32% document are both slightly other models. summarize practical implications work, pre-trained fine-tuned domain small corpora, better performance This approach may simplify development languages English, due minor quantity data needed training.
منابع مشابه
Improving Word Embeddings for NLP
Word embeddings are an important technique in natural language processing, and have been shown to significantly outperform previous methods. Word embeddings such as word2vec also exhibit interesting semantic properties, such that words with similar meaning lie close together in embedding space. The directions in embedding spaces can also correspond to semantic features, such that one can perfor...
متن کاملHow to Train good Word Embeddings for Biomedical NLP
The quality of word embeddings depends on the input corpora, model architectures, and hyper-parameter settings. Using the state-of-the-art neural embedding tool word2vec and both intrinsic and extrinsic evaluations, we present a comprehensive study of how the quality of embeddings changes according to these features. Apart from identifying the most influential hyper-parameters, we also observe ...
متن کاملHierarchical Semantic Structures for Medical NLP
We present a framework for building a medical natural language processing (NLP) system capable of deep understanding of clinical text reports. The framework helps developers understand how various NLP-related efforts and knowledge sources can be integrated. The aspects considered include: 1) computational issues dealing with defining layers of intermediate semantic structures to reduce the dime...
متن کاملSubstitute Based SCODE Word Embeddings in Supervised NLP Tasks
We analyze a word embedding method in supervised tasks. It maps words on a sphere such that words co-occurring in similar contexts lie closely. The similarity of contexts is measured by the distribution of substitutes that can fill them. We compared word embeddings, including more recent representations (Huang et al.2012; Mikolov et al.2013), in Named Entity Recognition (NER), Chunking, and Dep...
متن کاملThe emergent algebraic structure of RNNs and embeddings in NLP
We examine the algebraic and geometric properties of a uni-directional GRU and word embeddings trained end-to-end on a text classification task. A hyperparameter search over word embedding dimension, GRU hidden dimension, and a linear combination of the GRU outputs is performed. We conclude that words naturally embed themselves in a Lie group and that RNNs form a nonlinear representation of the...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: IEEE Access
سال: 2021
ISSN: ['2169-3536']
DOI: https://doi.org/10.1109/access.2021.3131386